{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "364c3467",
   "metadata": {},
   "source": [
    "# Question Answering System\n",
    "In this example we will be going over the code used to build a question answering system. This example uses a modified BERT model to extract features from questions and Milvus to search for similar questions and answers. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ffbf466",
   "metadata": {},
   "source": [
    "## Data\n",
    "This example uses the [InsuranceQA Corpus](https://github.com/shuzi/insuranceQA) dataset, which contains 27,413 answers with the 3,065,492 running words of answers.\n",
    "\n",
    "Download location: https://github.com/chatopera/insuranceqa-corpus-zh/tree/release/corpus/pairs\n",
    "\n",
    "In this example, we use a small subset of the dataset that contains 100 pairs of quesiton-answers, it can be found under the **data** directory."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49e47e44",
   "metadata": {},
   "source": [
    "## Requirements\n",
    "\n",
    "\n",
    "|  Packages   |  Servers    |\n",
    "|-                  | -                 |   \n",
    "| pymilvus          | milvus-1.1.0      |\n",
    "| sentence_transformers      | postgres          |\n",
    "| psycopg2          |\n",
    "| pandas           |\n",
    "| numpy   |\n",
    "\n",
    "We have included a `requirements.txt` file in order to easily satisfy the required packages. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3a9e53ff-7ee2-4960-be97-2a483b897e64",
   "metadata": {},
   "source": [
    "## Up and Running"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a932f169",
   "metadata": {},
   "source": [
    "### Installing Packages\n",
    "Install the required python packages with `requirements.txt`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7581227d",
   "metadata": {},
   "outputs": [],
   "source": [
    "pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1626a93f",
   "metadata": {},
   "source": [
    "### Starting Milvus Server\n",
    "\n",
    "This demo uses Milvus 1.1.0, please refer to the [Install Milvus](https://milvus.io/docs/v1.1.0/install_milvus.md) guide to learn how to use this docker container. For this example we wont be mapping any local volumes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d8f40223",
   "metadata": {},
   "outputs": [],
   "source": [
    "! docker run --name milvus_cpu_1.1.0 -d \\\n",
    "-p 19530:19530 \\\n",
    "-p 19121:19121 \\\n",
    "milvusdb/milvus:1.1.0-cpu-d050721-5e559c"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ef0fd9c",
   "metadata": {},
   "source": [
    "### Starting Postgres Server\n",
    "For now, Milvus doesn't support storing string data. Thus, we need a relational database to store questions and answers. In this example, we use [PostgreSQL](https://www.postgresql.org/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "47f28d44-a5a9-420e-b881-12f4cbf86031",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "docker: Error response from daemon: Conflict. The container name \"/postgres\" is already in use by container \"244c288ae2317ab906211bda102ab7c849a7840bbd781480926fa571b9fdee79\". You have to remove (or rename) that container to be able to reuse that name.\n",
      "See 'docker run --help'.\n"
     ]
    }
   ],
   "source": [
    "! docker run --name postgres -d  -p 5432:5432 -e POSTGRES_HOST_AUTH_METHOD=trust postgres"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4d5607a4-c1ca-43ea-bba9-c3bf40a0029c",
   "metadata": {},
   "source": [
    "### Confirm Running Servers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "bee7b27f-4eec-40af-8871-66ea069ce805",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "    __  _________ _   ____  ______    \n",
      "   /  |/  /  _/ /| | / / / / / __/    \n",
      "  / /|_/ // // /_| |/ / /_/ /\\ \\    \n",
      " /_/  /_/___/____/___/\\____/___/     \n",
      "\n",
      "Welcome to use Milvus!\n",
      "Milvus Release version: v1.1.0, built at 2021-05-06 14:50.43, with OpenBLAS library.\n",
      "You are using Milvus CPU edition\n",
      "Last commit id: 5e559cd7918297bcdb55985b80567cb6278074dd\n",
      "\n",
      "Loading configuration from: /var/lib/milvus/conf/server_config.yaml\n",
      "WARNNING: You are using SQLite as the meta data management, which can't be used in production. Please change it to MySQL!\n",
      "Supported CPU instruction sets: avx2, sse4_2\n",
      "FAISS hook AVX2\n",
      "Milvus server started successfully!\n"
     ]
    }
   ],
   "source": [
    "! docker logs milvus_cpu_1.1.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ff170df0-e2ae-4a2c-9a96-4b97a04b4e3d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2021-05-20 19:03:54.900 UTC [1] LOG:  starting PostgreSQL 13.2 (Debian 13.2-1.pgdg100+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit\n",
      "2021-05-20 19:03:54.900 UTC [1] LOG:  listening on IPv4 address \"0.0.0.0\", port 5432\n",
      "2021-05-20 19:03:54.900 UTC [1] LOG:  listening on IPv6 address \"::\", port 5432\n",
      "2021-05-20 19:03:54.903 UTC [1] LOG:  listening on Unix socket \"/var/run/postgresql/.s.PGSQL.5432\"\n",
      "2021-05-20 19:03:54.907 UTC [68] LOG:  database system was shut down at 2021-05-20 19:03:54 UTC\n",
      "2021-05-20 19:03:54.911 UTC [1] LOG:  database system is ready to accept connections\n"
     ]
    }
   ],
   "source": [
    "! docker logs postgres --tail 6"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40c4adb9",
   "metadata": {},
   "source": [
    "## Code Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d56b1cfd-bb73-47e3-80a8-19d56dc6b8dd",
   "metadata": {},
   "source": [
    "### Connecting to Servers\n",
    "We first start off by connecting to the servers. In this case the docker containers are running on localhost and the ports are the default ports. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53ec82a1",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Connectings to Milvus, BERT and Postgresql\n",
    "import milvus\n",
    "import psycopg2\n",
    "\n",
    "milv = milvus.Milvus(host='localhost', port='19530')\n",
    "conn = psycopg2.connect(host='localhost', port='5432', user='postgres', password='postgres')\n",
    "cursor = conn.cursor()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd775d94",
   "metadata": {},
   "source": [
    "### Creating Collection and Setting Index\n",
    "#### 1. Creating the Collection  \n",
    "A collection in Milvus is similar to a table in a relational database, and is used for storing all the vectors.  \n",
    "The required parameters for creating a collection are as follows:  \n",
    "- `collection_name`: the name of a collection.  \n",
    "- `dimension`: BERT generates 728-dimensional vectors.  \n",
    "- `index_file_size`: how large each data segment will be within the collection.      \n",
    "- `metric_type`: the distance formula being used to calculate similarity. In this example we are using Inner product (IP)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "0829a1ff",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Status(code=0, message='Create collection successfully!')\n"
     ]
    }
   ],
   "source": [
    "TABLE_NAME = 'question_answering'\n",
    "\n",
    "#Deleting previouslny stored table for clean run\n",
    "milv.drop_collection(TABLE_NAME)\n",
    "\n",
    "\n",
    "collection_param = {\n",
    "            'collection_name': TABLE_NAME,\n",
    "            'dimension': 768,\n",
    "            'index_file_size': 1024,  \n",
    "            'metric_type': milvus.MetricType.IP \n",
    "            }\n",
    "\n",
    "status = milv.create_collection(collection_param)\n",
    "print(status)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9f601a9",
   "metadata": {},
   "source": [
    "#### 2. Setting an Index\n",
    "After creating the collection we want to assign it an index type. This can be done before or after inserting the data. When done before, indexes will be made as data comes in and fills the data segments. In this example we are using IVF_FLAT which requires the 'nlist' parameter. Each index types carries its own parameters. More info about this param can be found [here](https://milvus.io/docs/v1.1.0/index.md#CPU)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "25e4eaa0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Status(code=0, message='Build index successfully!')\n"
     ]
    }
   ],
   "source": [
    "param = {'nlist': 40}\n",
    "status = milv.create_index(TABLE_NAME, milvus.IndexType.IVF_FLAT, param)\n",
    "print(status)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a53735f",
   "metadata": {},
   "source": [
    "### Creating Table in Postgres  \n",
    "PostgresSQL will be used to store the Milvus ID and its corresponding question-answer combo."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "6174d378",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "create postgres table successfully!\n"
     ]
    }
   ],
   "source": [
    "#Deleting previouslny stored table for clean run\n",
    "drop_table = \"DROP TABLE IF EXISTS \" + TABLE_NAME\n",
    "cursor.execute(drop_table)\n",
    "conn.commit()\n",
    "\n",
    "try:\n",
    "    sql = \"CREATE TABLE if not exists \" + TABLE_NAME + \" (ids bigint, question text, answer text);\"\n",
    "    cursor.execute(sql)\n",
    "    conn.commit()\n",
    "    print(\"create postgres table successfully!\")\n",
    "except Exception as e:\n",
    "    print(\"can't create a postgres table: \", e)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea03ccb7",
   "metadata": {},
   "source": [
    "### Processing and Storing QA Dataset\n",
    "#### 1. Generating Embeddings\n",
    "In this example we are using the sentence_transformer library  to encode the sentence into vectors. This library uses a modified BERT model to generate the embeddings, and in this example we are using a model pretrained using Microsoft's `mpnet`. More info can be found [here](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "a3372f03",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "You try to use a model that was created with version 1.2.0, however, your version is 1.1.1. This might cause unexpected behavior or errors. In that case, try to update to the latest version.\n",
      "\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from sentence_transformers import SentenceTransformer\n",
    "import pandas as pd\n",
    "from sklearn.preprocessing import normalize\n",
    "\n",
    "model = SentenceTransformer('paraphrase-mpnet-base-v2')\n",
    "\n",
    "# Get questions and answers.\n",
    "data = pd.read_csv('data/example.csv')\n",
    "question_data = data['question'].tolist()\n",
    "answer_data = data['answer'].tolist()\n",
    "\n",
    "sentence_embeddings = model.encode(question_data)\n",
    "sentence_embeddings = normalize(sentence_embeddings)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a87c089c",
   "metadata": {},
   "source": [
    "#### 2. Inserting Vectors into Milvus\n",
    "Since this example dataset contains only 100 vectors, we are inserting all of them as one batch insert."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "a57b3864",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Status(code=0, message='Add vectors successfully!')\n"
     ]
    }
   ],
   "source": [
    "status, ids = milv.insert(collection_name=TABLE_NAME, records=sentence_embeddings)\n",
    "print(status)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0cb833d2",
   "metadata": {},
   "source": [
    "#### 3. Inserting IDs and Questions-answer Combos into PostgreSQL\n",
    "In order to transfer the data into Postgres, we are creating a new file that combines all the data into a readable format. Once created, we pass this file into the Postgress server through STDIN due to the Postgres container not having access to the file locally. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "5776ac9e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Inserted into Postgress Sucessfully!\n"
     ]
    }
   ],
   "source": [
    "import os \n",
    "\n",
    "def record_temp_csv(fname, ids, answer, question):\n",
    "    with open(fname,'w') as f:\n",
    "        for i in range(len(ids)):\n",
    "            line = str(ids[i]) + \"|\" + question[i] + \"|\" + answer[i] + \"\\n\"\n",
    "            f.write(line)\n",
    "\n",
    "def copy_data_to_pg(table_name, fname, conn, cur):\n",
    "    fname = os.path.join(os.getcwd(),fname)\n",
    "    try:\n",
    "        sql = \"COPY \" + table_name + \" FROM STDIN DELIMITER '|' CSV HEADER\"\n",
    "        cursor.copy_expert(sql, open(fname, \"r\"))\n",
    "        conn.commit()\n",
    "        print(\"Inserted into Postgress Sucessfully!\")\n",
    "    except Exception as e:\n",
    "        print(\"Copy Data into Postgress failed: \", e)\n",
    "        \n",
    "DATA_WITH_IDS = 'data/test.csv'   \n",
    "\n",
    "record_temp_csv(DATA_WITH_IDS, ids, answer_data, question_data)\n",
    "copy_data_to_pg(TABLE_NAME, DATA_WITH_IDS, conn, cursor)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be2d9599",
   "metadata": {},
   "source": [
    "### Search\n",
    "#### 1. Processing Query\n",
    "When searching for a question, we first put the question through the same model to generate an embedding. Then with that embedding vector we  can search for similar embeddings in Milvus.  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "d77a9db1",
   "metadata": {},
   "outputs": [],
   "source": [
    "SEARCH_PARAM = {'nprobe': 40}\n",
    "\n",
    "query_vec = []\n",
    "\n",
    "question = \"What is AAA?\"\n",
    "\n",
    "query_embeddings = []\n",
    "embed = model.encode(question)\n",
    "embed = embed.reshape(1,-1)\n",
    "embed = normalize(embed)\n",
    "query_embeddings = embed.tolist()\n",
    "\n",
    "\n",
    "status, results = milv.search(collection_name=TABLE_NAME, query_records=query_embeddings, top_k=5, params=SEARCH_PARAM)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57d69d1b",
   "metadata": {},
   "source": [
    "#### 2. Getting the Similar Questions\n",
    "There may not have questions that are similar to the given one. So we can set a threshold value, here we use 0.5, and when the most similar distance retrieved is less than this value, a hint that the system doesn't include the relevant question is returned. We then use the result ID's to pull out the similar questions from the Postgres server and print them with their corresponding similarity score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "80b7c45b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are similar questions in the database, here are the closest matches: \n",
      "('What  Does  AAA  Home  Insurance  Cover?', 0.5728842616081238)\n",
      "('What  Does  Credit  Have  To  Do  With  Auto  Insurance?', 0.4042107164859772)\n",
      "('Is  Car  Insurance  Prepaid?', 0.31805452704429626)\n",
      "('Does  AARP  Have  Long  Term  Care  Insurance?', 0.30669423937797546)\n",
      "('Is  Car  Insurance  Credit  Checked?', 0.3045981228351593)\n"
     ]
    }
   ],
   "source": [
    "similar_questions = []\n",
    "\n",
    "if results[0][0].distance < 0.5:\n",
    "    print(\"There are no similar questions in the database, here are the closest matches:\")\n",
    "else:\n",
    "    print(\"There are similar questions in the database, here are the closest matches: \")\n",
    "    \n",
    "for result in results[0]:\n",
    "    sql = \"select question from \" + TABLE_NAME + \" where ids = \" + str(result.id) + \";\"\n",
    "    cursor.execute(sql)\n",
    "    rows=cursor.fetchall()\n",
    "    if len(rows):\n",
    "        similar_questions.append((rows[0][0], result.distance))\n",
    "        print((rows[0][0], result.distance))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cbd7dfb",
   "metadata": {},
   "source": [
    "#### 3. Get the answer\n",
    "After getting a list of similar questions, choose the one that you feel is closest to yours. Then you can use that question to find the corresponding answer in Postgres."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "id": "fa504d72",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Question:\n",
      "What is AAA?\n",
      "Answer:\n",
      " AAA Home insurance, like all other major carriers, covers a wide variety of claims, including fire, theft, vandalism, and many other items. However, there are numerous types of policies offered, so it is best to determine the type of policy you have to accurately understand all of the benefits. An experienced broker can help.\n"
     ]
    }
   ],
   "source": [
    "sql = \"select answer from \" + TABLE_NAME + \" where question = '\" + similar_questions[0][0] + \"';\"\n",
    "cursor.execute(sql)\n",
    "rows=cursor.fetchall()\n",
    "print(\"Question:\")\n",
    "print(question)\n",
    "print(\"Answer:\")\n",
    "print(rows[0][0])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
